Safety Evaluation

Back

The section discusses the evaluation of the safety of LLMs (Large Language Models). Two main categories of evaluation are identified: robustness evaluation and risk evaluation.

Robustness Evaluation: The robustness of LLMs is crucial for the development of models with stable performance. The robustness evaluation of LLMs is divided into three categories: prompt robustness, task robustness, and alignment robustness.

Prompt Robustness: PromptBench is proposed as a benchmark for evaluating the robustness of LLMs by attacking them with adversarial prompts. It evaluates eight different NLP tasks using dynamically created character-, word-, sentence-, and semantic-level prompts. Additionally, the robustness of LLMs in handling prompt typos is evaluated using prompts from the Justice dataset.
Task Robustness: Various studies evaluate the robustness of LLMs across different NLP tasks:
- ChatGPT is evaluated for translation, question-answering, text classification, and natural language inference using benchmark datasets.
- Translation task robustness is evaluated using the WMT datasets, which contain naturally occurring noises and domain-specific terminology words.
- Question-answering task robustness is evaluated using datasets that contain table headers, table content, natural language questions (NLQ), and various perturbations.
- Text classification task robustness is evaluated using synthetic datasets generated by SynTextBench.
- Classification tasks using Japanese language datasets are also evaluated.

Furthermore, the robustness of LLMs is evaluated for code generation, mathematical reasoning, and dialogue generation tasks. Benchmarks are proposed for each task to evaluate the robustness of LLMs in these areas.

Alignment Robustness: The alignment robustness of LLMs is evaluated to ensure the stability of alignment towards human values. Jailbreak methods are used to attack LLMs and generate harmful or unsafe behavior and content. Benchmarks and frameworks, such as “MasterKey,” are proposed to evaluate the alignment robustness of LLMs.

Risk Evaluation: As LLMs approach or reach human-level capabilities, there is a need to evaluate their safety to identify and mitigate catastrophic risks. Risk evaluation can be done by discovering risky behaviors of LLMs and evaluating LLMs as agents.

Evaluating LLMs Behaviors: Multiple studies aim to discover risky behaviors of LLMs by constructing datasets and evaluating LLMs’ behaviors:
- Categories of risks are defined, such as instrumental subgoals, myopia, situational awareness, willingness to coordinate with other AIs, and decision theory, to generate multiple-choice questions and evaluate the risks of LLMs.
- Mistakes and logical errors of LLMs are discovered in decision-making tasks, future event prediction, legal judgment, social reasoning, and causal inference tasks.
- The cooperativeness of LLMs is evaluated in high-stakes interactions with other agents.
Evaluating LLMs as Agents: LLMs’ abilities as agents in interactive environments are evaluated through benchmarks and sandbox environments. These evaluations assess the reasoning and decision-making abilities of LLMs and their ability to solve complex tasks in realistic and controlled environments. The evaluation of LLMs’ abilities as agents is still in its early stages.

In summary, the evaluation of the safety of LLMs is approached through robustness evaluation, which assesses the stability of LLMs, and risk evaluation, which identifies and mitigates potential catastrophic risks associated with LLMs’ behaviors and capabilities.

Words: 501